-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Top-K Nearest Neighbors for Normalized Matrix Profile #592 #595
Conversation
@seanlaw |
@seanlaw For |
If possible, we should try to account for the indices in the naive implementation. For
No, we should avoid this for this PR. This should only be a last resort as it will hide other issues. We will keep it in mind but 99% of the time, this is not needed. Btw, since the input arrays in the unit tests are random, I will usually run them 1,000+ times on my machine for new features/tests (especially if there is a big change in the test) before I push the commit. So I don't just run it once and assume that it's fine. |
For
Now that we use
Correct! If there is no tie, the option Please correct me if I am wrong... |
Note that
No, not quite. Currently, I believe that |
I did not know that! Interesting.... Thanks for the info.
It is supposed to be guaranteed (I think because they have the similar traversal method), but it is not:
But, the indices of |
Also, note that way we traverse matters... If we go row-by-row, the 5-th distance profile is: As you said, |
No, traversing diagonally should solve all of your problems. If you store a distance and later find another distance that is the same, you do not replace the first index no matter what that index is. You only replace the index if there is a smaller distance. The order in which the distances are encountered is key. This should still hold true for any |
I am trying to clarify things and catch up with you. So, if you do not mind, please allow me do a quick recap to make sure I am on the same page as you are. (1) We would like to believe that existing (2) For now, let us assume (1) is correct and everything is 100% OK. If we change the Therefore:
Exactly! I have the same thought. If we plan to have diagonal traversal in distance matrix in the function EVEN IF we traverse diagonally in both (Btw, sorry for making you tired on this.) |
Yes, it's less a belief and more an expectation. What are the counter examples? (2) For now, let us assume (1) is correct and everything is 100% OK. If we change the naive.stump to traverse matrix row by row. (which is different than stumpy.stump that traverses diagonally), we would like to believe that they give the same P and I regardless of their traversal method. NOTE: after the investigation, I realized that I was wrong and different traversal methods give different matrix profile indices. So, I don't expect row-wise traversal would give the same result as diagonal-wise traversal. In fact, IIRC, I updated
Why? How can this be possible? |
Btw, you may want to add a
|
Btw, note that |
I am providing one example below. After further investigation, I think I found the cause of such difference. It seems the discrepancy between
What do you think? Should this (small / rare ?) issue be taken care of? As you correctly said, "it's less a belief and more an expectation." Now, I understand that we should expect outputs to be exactly the same as they follow similar process (The only difference is that
Got it. Thanks for the clarification.
Cool! Thanks for the suggestion! So, if I understand correctly, for
I did not take a look at that. Thanks for letting me know about this. |
No, I am less concerned about this example
That sounds reasonable. |
@seanlaw
If I understand correctly, first I should take care of Then, I need to work on naive version of |
Yes, this is correct. Not only should it pass the unit test for
Hmmm, so |
Correct! All existing unit tests and the new one(s) for
So, our goal is adding parameter
Did you mean ALL test functions in |
Yes! And, eventually, to
So, I was thinking that we would update Does that make sense? If not, please feel free to ask more clarifying questions! Your questions are helping me think through things as well as it has been a while since I've had to ponder about this. |
Got it. We do not touch
It totally makes sense. Should we take care of this in parallel with this PR? like...at the end before finalizing everything? |
@seanlaw |
Yes, please proceed! |
@seanlaw |
Should I review the recent commits or shall I wait for the stump.py changes? |
Please wait for the the |
@seanlaw
So, if
I also did a quick search and it seems basic indexing creates a view rather than a copy, and it can be done in constant time. So, |
Because of such behavior, and also because top-k
Alternatively, we can store top-k smallest The modified function stump.py, which accepts parameter |
Yes, let's do that. Considering that At the end of the day:
How does that sound? Any questions, comments, or concerns? |
I will pull the latest changes to this branch.
Right! I was mixing things. As you said, the main goal is to (i) have a very little overhead for So, I need to do three things here:
|
Actually, I might even prefer simply:
So, you'd have to roll it back quite a fair bit to when I had already given the "okay" for a previous version. This way, I don't have to revisit all of the files when you add all of the subsequent changes in a later PR. Does that make sense? Probably somewhere before this point. Though, you still had |
I can do that. Just to make sure we are on the same page: I will undo the changes to get to this point. I assume you, for now, are "okay" with the overhead it added to the computation. And, we want to tackle the performance later in another PR. (btw, the least thing we can do is to just avoid using splitted |
c9a2f8d
to
428ef8c
Compare
@seanlaw
Please let me know what you think :) |
Thank you @NimaSarajpoor! I will find some time to review it |
@NimaSarajpoor I think everything looks good. Would you mind doing a performance comparison again just to confirm that we have reached the same state as before? Once we confirm this then I think we can merge. Btw, before we re-focus on the performance of stump* top-k, I think we should add focus on aamp* top-k first and complete all of that. This way, we can at least be at a functional state that is consistent everywhere and performance can come later. What do you think? |
I think that is a reasonable approach. Let's add support for top-k in both normalized and non-normalized first.
Will do so :) |
In below, I am providing the performance of
stump
NOTE: top-k performs better for stumped
NOTE: top-k performs better for gpu_stumpLATER! SEE the BOTTOM of THIS COMMENT prescrump
I think the computing time of scrump (with
|
scrump_prescrump==False (m=50) | n=1000 | n=10_000 | n=20_000 | n=50_000 | n=100_000 | n=200_000 |
---|---|---|---|---|---|---|
main | 0.0076 | 0.056 | 0.114 | 0.354 | 0.97 | 3.288 |
top-k (k==1) | 0.016 | 0.1129 | 0.23 | 0.61 | 1.42 | 3.835 |
top-k (k==10) | 0.021 | 0.382 | 0.986 | 3.19 | - | - |
stumpi (with egress=True
)
note: applied .update
once
stumpi_engress==True (m=50) | n=1000 | n=10_000 | n=20_000 | n=50_000 | n=100_000 | n=200_000 |
---|---|---|---|---|---|---|
main | 0.009 | 0.134 | 0.405 | 3.072 | 18.29 | 96.19 |
top-k (k==1) | 0.016 | 0.19 | 0.631 | 2.64 | 14.28 | 77.42 |
top-k (k==10) | 0.044 | 0.667 | 2.29 | 15.54 | - | - |
Again, note that top-k
performs better for n=50_000
, n=100_000
, and n=200_000
. (You may want to check it in your pc to see if my conclusion is correct so far or not)
gpu_stump
gpu_stump (m=50) | n=1000 | n=10_000 | n=20_000 | n=50_000 | n=100_000 | n=200_000 |
---|---|---|---|---|---|---|
main | 0.3 | 2.12 | 4.2 | 10.47 | 21.92 | 50.56 |
top-k (k==1) | 0.37 | 2.92 | 5.59 | 13.92 | 26.95 | 60.02 |
top-k (k==10) | 0.38 | 2.83 | 5.57 | 14.57 | - | - |
Personally, I do not like the performance of top-k gpu_stump
(notice that its computing time is about 20% greater than main
for n=100_000
and n=200_000
). So, I locally (on my local branch in my pc) revised this function and used one array n=100_000
and n=200_000
then become around 22 and 47, respectively! (very close to main
).
If you think we should revise gpu_stump
, I can push its commit. If you think it is okay, please feel free to merge it :)
@NimaSarajpoor Almost there. It looks like we aren't hitting some lines of code according to coverage tests:
It looks like we don't actually have any code that tests for when Thus, it's important to not only look for tests that pass/fail but also that we are achieving 100% coverage (without using |
I am trying to understand why I missed this! I thought the result of coverage should show up here. Right? But, I cannot find it now. After your comment, I checked out the Github Action log and found it there. But, I should have seen it here on this PR page as well. right?
I just took a look and noticed that there was a typo in test function |
@seanlaw |
I'm not sure. Sometimes I see it but I don't see it here either. 🤷♂️ Anyhow, no worries and thank you for addressing it.
@NimaSarajpoor Thank you for this wonderful contribution! Your dedication will be greatly appreciated by the matrix profile community! There is still a bit of follow up work but I appreciate all of your time and efforts. |
Thank you for allowing me to take this opportunity. It has been a wonderful journey. I learned a lot :) |
This PR addresses issue #592. In this PR, we want to extend the function
stump
and the related ones so that it returns Top-K Nearest Neighbors Matrix Profile (i.e. the k smallest values for each distance profile and their corresponding indices).What I have done so far:
stump_topk
stump
Notes:
(1)
np.searchsort()
is used in thenaive.stump_topk
. Another naive approach can be traversing distance matrix row-by-row, and usingnp.argpartition()
followed by `np.argsort()'.(2) I think considering the first k columns in the output of
stump_topk
is cleaner, and also easier for the user to get access to those top-k values.I can/will think about a clean way to change the structure of output just before returning it so that the first four columns becomes the same for all k. However, I think that makes it hard to use the output later in other modules.
Add To-Do List
k
tostump
k
tostumped
k
togpu_stump
k
toscrump
k
tostumpi
Side notes:
k
to non-normalized versions.P
as input. For instance, instumpy.motifs
, we may say something like... "P
is (top-1) 1D array matrix profile"